This report will analyse bulk RNA sequencing data from the GEO database, accession number: GSE213001, which looks at the lung tissue of people affected and unaffected by idiopathic pulmonary fibrosis (IPF). This data is associated with an article by Jaffar et al. published in 2022 (Jaffar et al., 2022) which investigates a biomarker of IPF disease development.
The data was obtained using the GEOquery Bioconductor package (Davis & Meltzer, 2007). Samples were collected from 20 diseased patients and 14 non-diseased controls. A total of 139 samples were collected. Summary statistics can be seen below for samples per condition (disease/normal) and samples per pateint.
| Tissue type | Frequency |
|---|---|
| Disease | 98 |
| Normal | 41 |
| Samples per patient | Number of patients |
|---|---|
| 1 | 2 |
| 2 | 20 |
| 3 | 1 |
| 4 | 21 |
| 5 | 2 |
Figure 1. Library size of all 139 samples. The mean library size is 13.8 million reads.
This dataset contains 15,065 genes and 139 samples. There are no duplicate genes or samples.
Using Ensembl (Dyer et al., 2025), Ensembl IDs found in the expression data were mapped to HGNC symbols, which are more readable and useful for downstream processing.
| ensembl_gene_id | hgnc_symbol | ALF001 | ALF004C | ALF004D |
|---|---|---|---|---|
| ENSG00000000003 | TSPAN6 | 426 | 340 | 376 |
| ENSG00000000419 | DPM1 | 410 | 289 | 379 |
| ENSG00000000457 | SCYL3 | 226 | 195 | 272 |
| ENSG00000000460 | FIRRM | 56 | 58 | 92 |
| ENSG00000000938 | FGR | 726 | 1889 | 1979 |
| ENSG00000000971 | CFH | 7922 | 5938 | 6026 |
Genes that did not map to a HUGO gene symbol were retained under their Ensembl ID, as they may still show differential expression. There were no duplicate genes, though if there had been, average expression could have be taken.
First, we would like to get an overview of the data by making a boxplot and a density plot of read counts. Counts per million were calculated using EdgeR (Robinson, McCarthy, & Smyth, 2010).
Figure 2. Box plot of counts per million (CPM) of all 139 samples CPM calculated using EdgeR (Robinson et al., 2010). Code adapted from BCB420 Lecture 5.
Some samples have a mean CPM clearly lower than the rest, but with so many samples it is difficult to compare them all. A density plot can help with this visualization.
Figure 3. Density plot of counts per million (CPM) for all 139 samples. CPM calculated using EdgeR (Robinson et al., 2010). Code adapted from BCB420 Lecture 5.
As expected, there is somewhat of a peak near zero counts as many genes are expressed in low quantity. One sample (orange) stands out from the rest, with the CPM curve shifted to the left.
Next, EdgeR was used to filter out genes with low expression levels
(counts < 3).
Figure 4. Box plot of counts per million of all 139 samples after removal of genes which are lowly expressed (< 3 counts).
Again, a density plot was made for visualization, and the drop in low counts is clear:
Figure 5. Density plot of counts per million after removal of genes which are lowly expressed (< 3 counts).
It is important to use a design matrix as it allows filtering to be done in a way that considers our conditions (normal vs diseased tissue). Without one, genes that are differentially expressed between conditions, could be filtered out if they have low counts in one of the two conditions.
Information on whether the sample is from a diseased on healthy person is found in the metadata, and this will be used to make the design matrix.
Figure 6. Box plot of counts per million of all 139 samples after removal of lowly expressed genes and using a design matrix
Once again, a density plot allows us to visualise the change in
distribution:
Figure 7. Density of counts per million of all 139 samples after removal of lowly expressed genes and using a design matrix
Compared to the initial plot, we can see that the number of genes with low counts has decreased. We should expect a similar result after the following normalization steps.
TMM is a sample-based normalization method, based off of the idea that most genes aren’t differentially expressed. It compares expression to a reference sample and is commonly used with RNA-seq data. It finds the log expression ratio (M value) for all genes, comparing to the reference, trims data with extreme M and A (expression) values, and calculates a weighted mean of the remaining ratios (Robinson & Oshlack, 2010).
EdgeR is used to create the DGEList object, which will serve as input for future steps (Robinson et al., 2010).
Figure 8. Box plot of Trimmed Mean of M-values (TMM)-normalized gene expression for all 139 samples. The default minimum count of 10 is used.
After TMM normalization, there are no obvious outliers, and the means are more closely aligned than after manual cleaning (Figure 6).
Figure 9. Box plot of Trimmed Mean of M-values (TMM)-normalized gene expression for all 139 samples. The default minimum count of 10 is used.
Compared to the original data (Figure 3), we can see that low counts have been removed and the curve is overall more compact.
Next, we represented the data in an a multidimensional scaling (MDS) plot to visualize the similarities and differences between samples.
Figure 10. Multidimensional scaling (MDS) plot of TMM normalized
data Red circles represent healthy donor samples, and black
triangles represent diseased donor samples. Points are positioned based
on similarity.
By inspection, there is not a clear separation between diseased and normal tissues. With all points being quite spread out, there are no obvious outliers.
First, a biological coefficient of varience (BCV) plot was created to show how gene expression varies between samples. This plot can be used to verify the dispersion of genes (BCV is the square root of dispersion).
Figure 11. Biological coefficient of variance (BCV) plot Common BCV plotted in red and trend BCV plotted in blue.
Common BCV is approximately 50%, while the trend BCV, which accounts for expression level, decreases from ~70% in less expressed genes to ~40% in higher expressed genes. This decrease is expected as genes with low count tend to contain more noise. Overall, the points remain relatively close to the trendline, which indicates successful dispersion estimation.
Some points have very high dispersion (maximum of 7.55), which could be due to each of the 139 samples being treated independently. Of the three most widely used differential expression analysis tools (limma, EdgeR and DESeq2), which perform similarly (Liu et al., 2021), limma (Ritchie et al., 2015) has the advantage of the duplicateCorrelation function, which models within-patient correlation (takes into account multiple samples coming from one individual). This is important as many patients contributed multiple samples to this data set, and treating them as indepent could inflate variance measurements. Voom was used to transform data from EdgeR so that it can be used by limma (Law, Chen, Shi, & Smyth, 2014).
The consensus correlation is 0.322, which indicates that there is a moderate correlation between samples from a single patient. A threshold of 0.05 was chosen for both p-value and FDR/multiple hypothesis correction as this is the standard cutoff used. After differential expression analysis, 8,192 genes pass the p < 0.05 threshold and 7,182 pass FDR correction (at most 5% of significant genes are expected to be false positives) out of the total 15,065 genes. Next, we can look at the top genes which are the most differentially expressed.
kable(head(results), caption = "Table 4. Preview of top differentially expressed genes", digits = 50, col.names = c("HGNC symbol", "logFC", "Average Expression", "p-value", "FDR corrected p-value"))
| HGNC symbol | logFC | Average Expression | p-value | FDR corrected p-value |
|---|---|---|---|---|
| LDLRAD4 | 1.295029 | 5.336294 | 1.743361e-31 | 2.618354e-27 |
| NECAB1 | -3.453363 | 0.965507 | 2.762368e-29 | 2.074400e-25 |
| LTBP1 | 1.690025 | 7.753609 | 3.765045e-27 | 1.884907e-23 |
| TMEM97 | -1.669245 | 3.762740 | 6.396247e-27 | 2.401631e-23 |
| CFH | 1.825359 | 8.729778 | 2.564183e-26 | 7.702294e-23 |
| ACSS2 | -1.206457 | 6.137581 | 2.594291e-25 | 6.493942e-22 |
Figure 12. Volcano plot representing the differential expression of genes in IPF. Red points represent genes which are significantly upregulated in diseased tissue (FDR < 0.05, logFC > 0), blue points represent genes that are significantly downregulated in diseased tissue (FDR < 0.05, logFC < 0), and grey points represent genes that are not significantly differentially expressed.
| Gene | logFC | FDR-corrected p-value |
|---|---|---|
| LDLRAD4 | 1.295029 | 2.618354e-27 |
| NECAB1 | -3.453363 | 2.074400e-25 |
| LTBP1 | 1.690025 | 1.884907e-23 |
| TMEM97 | -1.669245 | 2.401631e-23 |
| CFH | 1.825359 | 7.702294e-23 |
| ACSS2 | -1.206457 | 6.493942e-22 |
Finally, a heatmap was used to visualize gene expression in diseased tissues. Genes that are upregulated or downregulated in diseased tissues could be important to understanding IPF development.
Figure 13. Heatmap of top 20 differentially expressed genes based on FDR-corrected p-value. Comparing diseased tissues to normal tissues. Rows represent the top 20 genes and columns represent the 139 samples.Red represents positive z-scores (upregulation), and blue represents negative z-scores (downregulation).
There is clear clustering observed where genes are upregulated in one condition (diseased/normal tissue) and downregulated in the other.